Estimating SARS-CoV-2 infection probabilities with serological data and a Bayesian mixture model

The individual results of SARS-CoV-2 serological tests measured after the first pandemic wave of 2020 cannot be directly interpreted as a probability of having been infected. Plus, these results are usually returned as a binary or ternary variable, relying on predefined cut-offs. We propose a Bayesian mixture model to estimate individual infection probabilities, based on 81,797 continuous anti-spike IgG tests from Euroimmun collected in France after the first wave. This approach used serological results as a continuous variable, and was therefore not based on diagnostic cut-offs. Cumulative incidence, which is necessary to compute infection probabilities, was estimated according to age and administrative region. In France, we found that a “negative” or a “positive” test, as classified by the manufacturer, could correspond to a probability of infection as high as 61.8% or as low as 67.7%, respectively. “Indeterminate” tests encompassed probabilities of infection ranging from 10.8 to 96.6%. Our model estimated tailored individual probabilities of SARS-CoV-2 infection based on age, region, and serological result. It can be applied in other contexts, if estimates of cumulative incidence are available.

only slightly, after the infection.The proportion of non-responders has been reported to be between 5 and 24%, depending on classification criteria and methodologies 7,8 .
Estimating cumulative incidence on the basis of serological data is done by correcting for the sensitivity and specificity of the serological test.This can be done through Bayesian methods, as a means of preserving uncertainty in the sensitivity and specificity estimates 9 .Other methodological challenges are linked to the selection of the analysis sample and it representativeness, and to potential biases in the selection of individuals in whom the sensitivity and specificity of the serological test are calculated with respect to the source population.A spectrum bias is indeed often suspected, as symptomatic individuals are more likely to be detected and therefore recruited to study sensitivity.These symptomatic persons are also more likely to have higher antibody levels [10][11][12][13] .
Mixture models constitute an appealing solution to spectrum bias.Indeed, the distribution of serological results is directly estimated from the sample of people whose infection status is unknown, which represents the target population.Hence, these models do not rely entirely on a possibly biased sample to estimate sensitivity 14,15 .In the case of serosurveys that took place after the first wave of COVID-19, a mixture model can be described as a weighted average of two probability distributions: one distribution for the serological results of infected persons, and one distribution for the serological results of uninfected persons.The weight which is associated to the infected persons is the cumulative incidence.These models are however prone to identification issues, corresponding to situations where more than one tuple of parameters' values are consistent with the data.This situation happens notably when the two distributions overlap 14,16 .
Finally, estimating the probability of infection for a given individual can be enhanced by considering all the relevant information.First, the pre-test probability of infection in one individual corresponds to cumulative incidence.Thus, factors influencing cumulative incidence can be taken into account to modify this pre-test probability.Notably, it has been shown that seroprevalence varied significantly with administrative region and age class after the first wave in France 3 .Second, ELISA ODR, when considered as a continuous variable, varies within the categories of the discrete variable ("negative", "indeterminate", or "positive").Hence, returning this ternary variable instead of the continuous ELISA ODR results in a loss of information.Indeed, modeling ELISA ODR as a continuous variable has been shown to outperform modeling it as a discrete variable in terms of bias and error 15 .
The main objective of this study was to propose a mixture model for estimating tailored individual infection probabilities after the first wave of SARS-CoV-2.To do so, we developed the model on French data, considering age and region, and modeling serological results as a continuous variable.We show the importance of not discretizing serological results for individual diagnosis.Our secondary objectives were to quantify the proportion of "non-responders", and to estimate sensitivity and specificity for the serological test according to the manufacturer's cut-offs.We also aimed to refine age-specific infection fatality rate and infection hospitalization rate using cumulative incidence estimates.

Serological data
The data of SAPRIS-SERO, a previously described serosurvey, were used in the present study 4,5,17 .SAPRIS-SERO is based on the SAPRIS cohort ("SAnté, Perception, pratiques, Relations et Inégalités Sociales en population générale pendant la crise COVID-19"), which was set up in March 2020 to study epidemiological and social features of the COVID-19 epidemic in France 17 .The adult participants of SAPRIS were recruited from three adult cohorts based on the general population: • NutriNet-Santé is a general population cohort with online follow-up, focusing on nutrition.From the 170,000 participants included at the start of the study in 2009, 151,122 were still in the cohort in 2020 18 • CONSTANCES is a general population cohort, set up in 2012, which includes 204,973 adults selected to be a representative sample of the French adult population 19 .• E3N/E4N is a multi-generational adult cohort.It includes 113,000 persons: the women recruited at the start of the study (1990), their children, and the fathers of these children 20 .
All participants in these three initial cohorts with regular access to the Internet and still being followed in 2020 were invited to take part in the SAPRIS study, which consisted of self-administered questionnaires during the first wave.These questionnaires included notably demographic aspects and history of SARS-CoV-2 testing by RT-PCR.A total of 93,610 participants of SAPRIS were over 20, completed the questionnaires, and lived in metropolitan France.These participants were invited to take part in the SAPRIS-SERO study by taking a driedblood spot by themselves.The samples were sent to a virology laboratory (Unité des virus émergents, Marseille, France) for serological analysis using the commercial ELISA test (Euroimmun, Lübeck, Germany) detecting anti-SARS-CoV-2 IgG directed against the S1 domain of the spike protein.The results of ELISA assays performed using dried-blood spot samples demonstrated a 98.1% to 100% sensitivity and a 99.3% to 100% specificity with conventional serum assays as a standard 21,22 .A maximum of one test per participant was performed, and an ELISA result was available for 82,467.Participants reporting a positive RT-PCR test were considered infected.

Hospital and demographic data
The French population structure by 10-year age class and administrative region came from the Insee 2020 census (Institut national de la statistique et des études économiques) 23 .The data about COVID-19-related hospitalizations before the 1st of July 2020, by 10-year age class or by region, were obtained from SIVIC, the exhaustive national inpatient surveillance system used during the pandemic 24 .The data about general population mortality attributed to COVID-19 before the 1st of July 2020 were obtained from the CépiDc (Centre d' épidémiologie sur les causes médicales de décès) 25

Model
The statistical analysis was carried out within a Bayesian framework.In the rest of this section, prior distributions are not always explicitly written.If so, these distributions are uniform.Serological results, originally expressed as optical density ratios (ODR), were modeled after a logarithmic transformation to be compatible with the use of unbounded probability functions.In the following, P(ELISA) refers to the distribution of log-ODR.I refers to the set of age classes (10-year groups, starting from 20 years, with persons over 90 included in the over 80 group), and J is the set of French administrative regions.The distribution of ELISA log-ODR in the persons whose infection status is unknown, considering an age class i ∈ I and a region j ∈ J , was denoted P(ELISA|i, j) .This distribution was modeled as a mixture of the distributions P(ELISA + ) and P(ELISA − ) , corresponding to the distributions of ELISA log-ODR in the infected and uninfected individuals, respectively.The proportion of persons having been infected during the first wave (cumulative incidence), given i and j, was written p i,j : In the uninfected individuals, ELISA log-ODR was modeled with a skew-normal distribution.The distribution of ELISA log-ODR in the infected individuals was itself a mixture of two normal distributions: one distribution for the responders, P(ELISA R ) , and one distribution for the non-responders, P(ELISA NR ) .The proportion of non-responders was written p NR .A prior beta distribution for this proportion was specified to imply a prior 95% credible interval (95% CI) ranging from 1% to 40% (and thus covering the 5 to 24% estimates previously reported): 7,8 Cumulative incidence on the logit scale, for an age class i ∈ I and a region j ∈ J , was the sum of a regional inter- cept, α j , and of a log-odds ratio of age, β i without interaction: A weakly informative normal prior distribution was specified for the age log-odds ratios ( β i ), with mean 0 and standard deviation 1.
The cumulative distribution functions of ELISA log-ODR in the infected and uninfected individuals allowed estimating the sensitivity and specificity of the test as a binary variable, for several cut-offs.Using specificity and sensisivity, we estimated the area under the receiver operating curve (AUC), and the Younden's J statistic.The Younden's J statistic is the sum of specificity and sensitivity, minus one.
A potential decay of ELISA log-ODR over time was assessed with a frequentist linear regression in RT-PCR positive participants.

Infection probability given ELISA ODR as continuous variable
The probability p x,i,j of having been infected given an ELISA ODR value x, an age group i, and a region j, was computed using Bayes' rule.With P(x|infected) and P(x|uninfected) being the probability densities of the ELISA ODR value x in the infected and uninfected groups, respectively,

Model comparison
We compared our model with alternative models using an approximation of the leave-one-out cross-validation using Pareto-smoothed importance sampling (PSIS-LOO) 26 .PSIS-LOO provides a Bayesian leave-one-out estimate of the expected log pointwise predictive density, with higher values indicating a better model for prediction.We used PSIS-LOO to assess the role of the non-responders component.We also compared the skew normal distribution of P(ELISA − ) with a normal distribution, and assessed the contribution of age and location to the fit.

Post-stratified cumulative incidence and infection-outcome rates (external validity)
Cumulative incidence was reconstituted at the scale of age groups, regions, and at the scale of metropolitan France, to validate our model in the light of previous published studies.To correct for differences in age and geographical structures between the French population and the SAPRIS-SERO cohort, age-specific cumulative incidences were reconstructed by post-stratification from p i,j terms, considering the population size pop i,j :

Algorithm and software
The data management used the R software version 4.2.3, and the modeling was done with the Stan software, which implements Hamiltonian Monte Carlo (R package cmdstanr version 2.32.0) 27,28  We encountered identification issues when fitting the mixture model, in the form of high R statistics, low effective sample sizes, and abnormal trace plots.We overcame these issues with a sequential approach.First, we estimated the distribution of ELISA ODR in infected individuals separately in a first model (319 persons with positive RT-PCR, see below).Second, we plugged the mean parameters' estimates of this first model as data in the main model (this approach is called the plug-in principle) 29,30 .When computing sensitivity, AUC, and infection probability in the main model, uncertainty in the distribution of ELISA ODR in infected individuals was partially restored.This was done by drawing a set of parameters from a multi-normal approximation of the posterior distribution of the first model for each MCMC iteration of the main model.

Ethical approval and consent to participate
Ethical approval and written or electronic informed consent were obtained from each participant before enrollment in the original cohort.The SAPRIS-SERO study was approved by the Sud-Mediterranée III ethics committee (approval 20.04.22.74247), and electronic informed consent was obtained from all participants for dried blood spot testing.The study was registered (#NCT04392388).All methods were performed in accordance with the relevant guidelines and regulations.

Participants
All samples were collected between May and November 2020.Supplementary Figure 1 illustrates the timing of logical sampling and the timing of hospitalizations for COVID-19 in France during the first wave and early second wave.Among the total cohort of 82,467 participants with one serological test, 319 had a positive RT-PCR test.These 319 participants constituted the sample with known infection (mean age of 52 years, 29% men, mean elapsed time between the RT-PCR and dried blood sampling of 100 days, with a minimum of 12 days and a maximum of 190 days).After excluding 351 samples of individuals with missing data on administrative region of residence, the sample with participants of undetermined infection status included the remaining 81,797 participants (mean age 58 years, 35% men).No group of participants known uninfected was available.The number of observations for each region and each age group is provided in Supplementary Tables 5 and 6.

Distribution of ELISA log-ODR
We did not find a significant decay of ELISA log-ODR over time in RT-PCR positive participants over the study period.The slope of the frequentist linear regression of ELISA log-ODR on the time between RT-PCR and serological testing was −0.03 (95% CI, −0.12 to 0.7, p = 0.56 ).Supplementary Figure 2 illustrates this result.
The observed ELISA log-ODR distributions are displayed in Fig. 1, along with the distributions inferred by the model.
Among the infected individuals, the proportion of non-responders was estimated to be 14.5% (95% CI, 10.5-19.0%).The posterior estimates of the parameters involved in the distributions of ELISA log-ODR among the infected uninfected individuals are provided in Supplementary Tables 1-4.
These distributions imply an AUC of 92.3% (95% CI, 90.0% to 94.3%) for the serological test.Estimated sensitivities, specificities and Younden's J statistics for the cut-offs 0.8 and 1.1 (ODR) are displayed in Table 1.

COVID-19 retrospective diagnosis: estimating individual infection probability
The model was used to estimate infection probability at the individual scale in France, accounting for age, location (administrative region), and ELISA ODR as a continuous variable.Figure 2 illustrates how the probability of infection is related to ELISA ODR in two regions and three age groups representing the range of cumulative incidence.We found that a "negative" ELISA ODR (below 0.8) could be associated with an infection probability as high as 61.8% (95% CI, 52.7% to 68.6%), corresponding to an ELISA ODR of 0.8 for a person of 40-49 years living in Île-de-France (the region with the highest cumulative incidence).Conversely, a "positive" ELISA ODR (over 1.1) was compatible with a probability of infection as low as 67.7% (95% CI, 59.1% to 75.2%), corresponding to an ELISA ODR of 1.1 for a person over 80 living in Bretagne (the region with the lowest cumulative incidence).The "indeterminate" category (ODR from 0.8 to 1.

Model comparison
In the first model (distribution of ELISA log-ODR in the infected individuals), the PSIS-LOO estimate decreased from −437 to −449 when replacing the distribution of ELISA log-ODR in the infected individuals with a unique skew normal distribution.The PSIS-LOO estimate decreased from −437 to −478 when using a unique normal distribution.
In the main model, the PSIS-LOO estimate decreased from −39433 to −39605 when removing administrative region, from −39433 to −40255 when removing age, and from -39433 to -41779 when replacing the skew normal distribution of ELISA log-ODR in the uninfected individuals with a normal distribution.

Discussion
We used a Bayesian mixture model to produce individual infection probability estimates in the context of the first wave of the SARS-CoV-2 pandemic in France.We showed that when considering age, region, and ELISA ODR as a continuous variable, each of the three categories of manufacturer's classification covered a wide range of infection probabilities.Using the distributions of ELISA log-ODR inferred by the model, found a sensitivity of 75.9% for the 1.1 cut-off, which is below the 91.4% previously reported 6 .Specificity was high, even for the 0.8 cut-off (99.8%), in line with previous studies 6 .Among the infected individuals, the model estimated a proportion of non-responders of 14.5% (95% CI, 10.5-19.0%), in accordance with previous studies 7,8 The model's cumulative incidence estimates were in accordance with previously reported seroprevalence (about 5% in the whole country, and 10% in the most affected areas) [3][4][5]31 . Likwise, the highest cumulative incidence between 30 and 49 years that we found was in line with the higher seroprevalence previously reported in these age groups 3 .Infection hospitalization rate and infection fatality rate increased at exponential paces with age in adults, in a similar magnitude of those previously reported [31][32][33][34][35] Other studies have sought to estimate the probability of SARS-CoV-2 infection, based on serological data in the form of a binary variable.These studies therefore estimated a positive predictive value and a negative predictive value.Based on the 1.1 cut-off, GeurtsvanKessel et al. (2020) showed that the same Euroimmun serological test as used in our study had a positive predictive value ranging from 84% to 100%, for a cumulative incidence ranging from 4% to 95%, respectively 36 .For the same interval of cumulative incidence, the test had a negative predictive value ranging from 22% to 99%.Using different serological tests and under varying prevalence, Brownstein and Chen (2021) showed that the proportion of positive tests being false ranged from 3% to 88%, while the proportion of negative tests being false remained below 10% 37  Several studies have used mixture models in the context of the SARS-CoV-2 pandemic.Their objectives were to estimate cumulative incidence without relying on previously reported sensitivity and specificity, notably to for a possible spectrum bias.However, these studies did not use the model to generate individual-level probabilities of infection 14,15,38 .Bottomley et al. (2021) used a normal distribution for the uninfected individuals and a skew normal distribution for the infected individuals 14 . I the context of our data, we found that a skew normal distribution was more suitable to model the distribution of ELISA log-ODR in the uninfected individuals.The presence of the non-responders in the model improved the fit, as quantified by PSIS-LOO.
Several modeling assumptions were made.First, the distribution of ELISA log-ODR in the infected individuals did not take age into account.Similarly, the decrease of antibody levels with time was not modeled.Indeed, the waning of anti-spike 1 IgG was reported to be weak in the year after a natural SARS-CoV-2 infection, and the time between infection and testing could not exceed nine months in the current study 39 .When studying RT-PCR positive participants, we did not find a significant decrease in ELISA log-ODR over time.
Another limitation was due to identification issues, which are common in mixture models 16 .To overcome these identification issues, we estimated the distribution of ELISA log-ODR in the infected individuals based only on RT-PCR positive participants.Bottomley et al. (2021) used a similar approach, estimating some parameters in pre-COVID-19 samples and fixing these parameters afterward 14 .As a consequence, the uncertainty in cumulative incidence was under-estimated.This uncertainty was partially restored when computing sensitivity, AUC and infection probability.This sequential approach, known as the plug-in principle, has a second drawback.Indeed, a spectrum bias, if present, could not be taken into account as ELISA log-ODR distribution was only estimated from the RT-PCR positive participants.
Our method can also be used to calculate individual probabilities of infection after the first wave outside of France, given an ELISA ODR value and cumulative incidence estimates.An application based on published cumulative incidence estimates in New-York City and Connecticut is provided in the Supplementary information file.
In conclusion, the model estimated tailored individual infection probabilities based on age, region, and on a serological test modeled as a continuous variable.SAPRIS-SERO study: ANR (Agence Nationale de la Recherche, #ANR-10-COHO-06), Fondation pour la Recherche Médicale (#20RR052-00), Inserm (Institut National de la Santé et de la Recherche Médicale, #C20-26).The sponsor and funders facilitated data acquisition but did not participate in the study design, analysis, interpretation or drafting.Cohorts funding: The CONSTANCES Cohort Study is supported by the Caisse Nationale d' Assurance Maladie (CNAM), the French Ministry of Health, the Ministry of Research, the Institut national de la santé et de la recherche médicale.CONSTANCES benefits from a grant from the French National Research Agency [grant number ANR-11-INBS-0002] and is also partly funded by MSD, AstraZeneca, Lundbeck and L'Oreal.The E3N-E4N cohort is supported by the following institutions: Ministère de l'Enseignement Supérieur, de la Recherche et de l'Innovation, INSERM, University Paris-Saclay, Gustave Roussy, the MGEN, and the French League Against Cancer.The NutriNet-Santé study is supported by the following public institutions: Ministère de la Santé, Santé Publique France, Institut National de la Santé et de la Recherche Médicale (INSERM), Institut
. The Monte Carlo sampling consisted in 6 chains of 2 000 iterations each (including 1 000 warm up iterations).Trace plots, R statistics and effective Monte Carlo sample sizes provided by Stan were used to assess convergence.Only two MCMC chains were run for PSIS-LOO estimation, due to memory usage.The model's code (in Stan) is provided in Supplementary Code 1, and in a public repository available at https:// github.com/ bglem ain/ Refin ing-COVID-19-retro spect ive-diagn osis.
1) encompassed highly variable p France = France and having an ELISA ODR of 1.1.In this subsection, we did not consider the estimates of the region Corsica, due to the low count of tests made in this region.An exhaustive interactive table returning infection probability given age, region, and ELISA ODR is provided in the Supplementary media file.